Data Science Jobs and Salary Analysis And Visualization

We did a survey based questionnaire among our colleagues from the work & university, as we have data science and business analyst roles in different departments and then based on that data we gathered we implemented it to the original data and interpreted the impacts of our observations to the original results. First, we find a simple study based on small-sample survey data. Then do the following: Reproduce the original findings using the original data. Collect new observations around 70. Replicate the results with the extended sample. Present our findings and discuss them
Data Description
Importing Libraries
import warnings
warnings.filterwarnings('ignore')
pip install plotly==5.10.0
Requirement already satisfied: plotly==5.10.0 in c:\users\hp\anaconda3\lib\site-packages (5.10.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\hp\anaconda3\lib\site-packages (from plotly==5.10.0) (8.0.1) Note: you may need to restart the kernel to use updated packages.
%%capture
!pip install country_converter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import country_converter as coco
Loading Data
data = pd.read_csv("C:\\Users\\HP\\Desktop\\DS_Salaries\\ds_salaries.csv")
data.head()
| Unnamed: 0 | work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2020 | MI | FT | Data Scientist | 70000 | EUR | 79833 | DE | 0 | DE | L |
| 1 | 1 | 2020 | SE | FT | Machine Learning Scientist | 260000 | USD | 260000 | JP | 0 | JP | S |
| 2 | 2 | 2020 | SE | FT | Big Data Engineer | 85000 | GBP | 109024 | GB | 50 | GB | M |
| 3 | 3 | 2020 | MI | FT | Product Data Analyst | 20000 | USD | 20000 | HN | 0 | HN | S |
| 4 | 4 | 2020 | SE | FT | Machine Learning Engineer | 150000 | USD | 150000 | US | 50 | US | L |
data1 = pd.read_csv("C:\\Users\\HP\\Desktop\\DS_Salaries\\ds_salaries_observations_added.csv")
data1.head()
| Unnamed: 0 | work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | Unnamed: 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 2020.0 | MI | FT | Data Scientist | 70000.0 | EUR | 79833.0 | DE | 0.0 | DE | L | NaN |
| 1 | 1.0 | 2020.0 | SE | FT | Machine Learning Scientist | 260000.0 | USD | 260000.0 | JP | 0.0 | JP | S | NaN |
| 2 | 2.0 | 2020.0 | SE | FT | Big Data Engineer | 85000.0 | GBP | 109024.0 | GB | 50.0 | GB | M | NaN |
| 3 | 3.0 | 2020.0 | MI | FT | Product Data Analyst | 20000.0 | USD | 20000.0 | HN | 0.0 | HN | S | NaN |
| 4 | 4.0 | 2020.0 | SE | FT | Machine Learning Engineer | 150000.0 | USD | 150000.0 | US | 50.0 | US | L | NaN |
data.columns #for original data
Index(['Unnamed: 0', 'work_year', 'experience_level', 'employment_type',
'job_title', 'salary', 'salary_currency', 'salary_in_usd',
'employee_residence', 'remote_ratio', 'company_location',
'company_size'],
dtype='object')
data1.columns #observations added data
Index(['Unnamed: 0', 'work_year', 'experience_level', 'employment_type',
'job_title', 'salary', 'salary_currency', 'salary_in_usd',
'employee_residence', 'remote_ratio', 'company_location',
'company_size', 'Unnamed: 12'],
dtype='object')
data["job_title"].nunique()
50
data1["job_title"].nunique()
50
data["job_title"].value_counts(sort=True)
Data Scientist 143 Data Engineer 132 Data Analyst 97 Machine Learning Engineer 41 Research Scientist 16 Data Science Manager 12 Data Architect 11 Big Data Engineer 8 Machine Learning Scientist 8 Principal Data Scientist 7 Director of Data Science 7 Data Science Consultant 7 Data Analytics Manager 7 AI Scientist 7 ML Engineer 6 BI Data Analyst 6 Lead Data Engineer 6 Computer Vision Engineer 6 Applied Data Scientist 5 Head of Data 5 Data Engineering Manager 5 Business Data Analyst 5 Head of Data Science 4 Applied Machine Learning Scientist 4 Data Analytics Engineer 4 Analytics Engineer 4 Principal Data Engineer 3 Computer Vision Software Engineer 3 Data Science Engineer 3 Lead Data Scientist 3 Machine Learning Developer 3 Lead Data Analyst 3 Machine Learning Infrastructure Engineer 3 Cloud Data Engineer 2 Product Data Analyst 2 Financial Data Analyst 2 Principal Data Analyst 2 Director of Data Engineering 2 ETL Developer 2 Staff Data Scientist 1 Lead Machine Learning Engineer 1 Marketing Data Analyst 1 Data Specialist 1 Head of Machine Learning 1 Data Analytics Lead 1 Finance Data Analyst 1 Big Data Architect 1 Machine Learning Manager 1 3D Computer Vision Researcher 1 NLP Engineer 1 Name: job_title, dtype: int64
data1["job_title"].value_counts(sort=True)
Data Scientist 154 Data Engineer 137 Data Analyst 100 Machine Learning Engineer 44 Research Scientist 19 Data Science Manager 15 BI Data Analyst 11 Data Architect 11 ML Engineer 10 AI Scientist 10 Big Data Engineer 9 Business Data Analyst 8 Machine Learning Scientist 8 Data Analytics Engineer 8 Principal Data Scientist 7 Data Science Consultant 7 Data Analytics Manager 7 Director of Data Science 7 Analytics Engineer 7 Computer Vision Engineer 6 Machine Learning Developer 6 Applied Data Scientist 6 Lead Data Engineer 6 Cloud Data Engineer 5 Head of Data 5 Data Engineering Manager 5 Applied Machine Learning Scientist 4 Big Data Architect 4 Financial Data Analyst 4 Head of Data Science 4 Principal Data Engineer 3 Data Science Engineer 3 Data Specialist 3 Machine Learning Infrastructure Engineer 3 Computer Vision Software Engineer 3 Lead Data Scientist 3 Lead Data Analyst 3 Marketing Data Analyst 3 ETL Developer 2 Principal Data Analyst 2 Director of Data Engineering 2 Product Data Analyst 2 Lead Machine Learning Engineer 1 Data Analytics Lead 1 Machine Learning Manager 1 Finance Data Analyst 1 NLP Engineer 1 Head of Machine Learning 1 3D Computer Vision Researcher 1 Staff Data Scientist 1 Name: job_title, dtype: int64
data.drop(['Unnamed: 0', 'salary', 'salary_currency'], axis=1, inplace=True) #Original data
data1.drop(['Unnamed: 0', 'salary', 'salary_currency'], axis=1, inplace=True) #Observations added
data.info() #Original
<class 'pandas.core.frame.DataFrame'> RangeIndex: 607 entries, 0 to 606 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 work_year 607 non-null int64 1 experience_level 607 non-null object 2 employment_type 607 non-null object 3 job_title 607 non-null object 4 salary_in_usd 607 non-null int64 5 employee_residence 607 non-null object 6 remote_ratio 607 non-null int64 7 company_location 607 non-null object 8 company_size 607 non-null object dtypes: int64(3), object(6) memory usage: 42.8+ KB
data1.info() #Observations added
<class 'pandas.core.frame.DataFrame'> RangeIndex: 682 entries, 0 to 681 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 work_year 674 non-null float64 1 experience_level 674 non-null object 2 employment_type 674 non-null object 3 job_title 674 non-null object 4 salary_in_usd 674 non-null float64 5 employee_residence 674 non-null object 6 remote_ratio 674 non-null float64 7 company_location 674 non-null object 8 company_size 674 non-null object 9 Unnamed: 12 0 non-null float64 dtypes: float64(4), object(6) memory usage: 53.4+ KB
for i in range(data['experience_level'].shape[0]):
if data['experience_level'][i] == 'EN':
data['experience_level'][i] = 'Entry Level'
elif data['experience_level'][i] == 'MI':
data['experience_level'][i] = 'Mid Level'
elif data['experience_level'][i] == 'SE':
data['experience_level'][i] = 'Senior Level'
else:
data['experience_level'][i] = 'Executive Level'
for i in range(data['employment_type'].shape[0]):
if data['employment_type'][i] == 'FT':
data['employment_type'][i] = 'Full Time'
elif data['employment_type'][i] == 'PT':
data['employment_type'][i] = 'Part Time'
elif data['employment_type'][i] == 'CT':
data['employment_type'][i] = 'Contract'
else:
data['employment_type'][i] = 'Freelance'
for i in range(data['remote_ratio'].shape[0]):
if data['remote_ratio'][i] == 0:
data['remote_ratio'][i] = 'No remote work'
elif data['remote_ratio'][i] == 50:
data['remote_ratio'][i] = 'Partially remote'
else:
data['remote_ratio'][i] = 'Fully remote'
for i in range(data1['experience_level'].shape[0]):
if data1['experience_level'][i] == 'EN':
data1['experience_level'][i] = 'Entry Level'
elif data1['experience_level'][i] == 'MI':
data1['experience_level'][i] = 'Mid Level'
elif data1['experience_level'][i] == 'SE':
data1['experience_level'][i] = 'Senior Level'
else:
data1['experience_level'][i] = 'Executive Level'
for i in range(data1['employment_type'].shape[0]):
if data1['employment_type'][i] == 'FT':
data1['employment_type'][i] = 'Full Time'
elif data1['employment_type'][i] == 'PT':
data1['employment_type'][i] = 'Part Time'
elif data1['employment_type'][i] == 'CT':
data1['employment_type'][i] = 'Contract'
else:
data1['employment_type'][i] = 'Freelance'
for i in range(data1['remote_ratio'].shape[0]):
if data1['remote_ratio'][i] == 0:
data1['remote_ratio'][i] = 'No remote work'
elif data1['remote_ratio'][i] == 50:
data1['remote_ratio'][i] = 'Partially remote'
else:
data1['remote_ratio'][i] = 'Fully remote'
data1.head()
| work_year | experience_level | employment_type | job_title | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | Unnamed: 12 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020.0 | Mid Level | Full Time | Data Scientist | 79833.0 | DE | No remote work | DE | L | NaN |
| 1 | 2020.0 | Senior Level | Full Time | Machine Learning Scientist | 260000.0 | JP | No remote work | JP | S | NaN |
| 2 | 2020.0 | Senior Level | Full Time | Big Data Engineer | 109024.0 | GB | Partially remote | GB | M | NaN |
| 3 | 2020.0 | Mid Level | Full Time | Product Data Analyst | 20000.0 | HN | No remote work | HN | S | NaN |
| 4 | 2020.0 | Senior Level | Full Time | Machine Learning Engineer | 150000.0 | US | Partially remote | US | L | NaN |
num_cols = data.select_dtypes('number')
cat_cols = data.select_dtypes('O')
print(f"Numerical Columns: {num_cols.columns}")
print(f"Categorical Columns: {cat_cols.columns}")
Numerical Columns: Index(['work_year', 'salary_in_usd'], dtype='object')
Categorical Columns: Index(['experience_level', 'employment_type', 'job_title',
'employee_residence', 'remote_ratio', 'company_location',
'company_size'],
dtype='object')
num_cols1 = data1.select_dtypes('number')
cat_cols1 = data1.select_dtypes('O')
print(f"Numerical Columns: {num_cols1.columns}")
print(f"Categorical Columns: {cat_cols1.columns}")
Numerical Columns: Index(['work_year', 'salary_in_usd', 'Unnamed: 12'], dtype='object')
Categorical Columns: Index(['experience_level', 'employment_type', 'job_title',
'employee_residence', 'remote_ratio', 'company_location',
'company_size'],
dtype='object')
Unique Values in each Column
print(f"work_year: {num_cols['work_year'].unique()}")
print('\n*********************************************************************************************\n')
for c in range(cat_cols.shape[1]):
print(f"{cat_cols.columns[c]}: {cat_cols.iloc[:, c].unique()}")
print('\n*********************************************************************************************\n')
work_year: [2020 2021 2022] ********************************************************************************************* experience_level: ['Mid Level' 'Senior Level' 'Entry Level' 'Executive Level'] ********************************************************************************************* employment_type: ['Full Time' 'Contract' 'Part Time' 'Freelance'] ********************************************************************************************* job_title: ['Data Scientist' 'Machine Learning Scientist' 'Big Data Engineer' 'Product Data Analyst' 'Machine Learning Engineer' 'Data Analyst' 'Lead Data Scientist' 'Business Data Analyst' 'Lead Data Engineer' 'Lead Data Analyst' 'Data Engineer' 'Data Science Consultant' 'BI Data Analyst' 'Director of Data Science' 'Research Scientist' 'Machine Learning Manager' 'Data Engineering Manager' 'Machine Learning Infrastructure Engineer' 'ML Engineer' 'AI Scientist' 'Computer Vision Engineer' 'Principal Data Scientist' 'Data Science Manager' 'Head of Data' '3D Computer Vision Researcher' 'Data Analytics Engineer' 'Applied Data Scientist' 'Marketing Data Analyst' 'Cloud Data Engineer' 'Financial Data Analyst' 'Computer Vision Software Engineer' 'Director of Data Engineering' 'Data Science Engineer' 'Principal Data Engineer' 'Machine Learning Developer' 'Applied Machine Learning Scientist' 'Data Analytics Manager' 'Head of Data Science' 'Data Specialist' 'Data Architect' 'Finance Data Analyst' 'Principal Data Analyst' 'Big Data Architect' 'Staff Data Scientist' 'Analytics Engineer' 'ETL Developer' 'Head of Machine Learning' 'NLP Engineer' 'Lead Machine Learning Engineer' 'Data Analytics Lead'] ********************************************************************************************* employee_residence: ['DE' 'JP' 'GB' 'HN' 'US' 'HU' 'NZ' 'FR' 'IN' 'PK' 'PL' 'PT' 'CN' 'GR' 'AE' 'NL' 'MX' 'CA' 'AT' 'NG' 'PH' 'ES' 'DK' 'RU' 'IT' 'HR' 'BG' 'SG' 'BR' 'IQ' 'VN' 'BE' 'UA' 'MT' 'CL' 'RO' 'IR' 'CO' 'MD' 'KE' 'SI' 'HK' 'TR' 'RS' 'PR' 'LU' 'JE' 'CZ' 'AR' 'DZ' 'TN' 'MY' 'EE' 'AU' 'BO' 'IE' 'CH'] ********************************************************************************************* remote_ratio: ['No remote work' 'Partially remote' 'Fully remote'] ********************************************************************************************* company_location: ['DE' 'JP' 'GB' 'HN' 'US' 'HU' 'NZ' 'FR' 'IN' 'PK' 'CN' 'GR' 'AE' 'NL' 'MX' 'CA' 'AT' 'NG' 'ES' 'PT' 'DK' 'IT' 'HR' 'LU' 'PL' 'SG' 'RO' 'IQ' 'BR' 'BE' 'UA' 'IL' 'RU' 'MT' 'CL' 'IR' 'CO' 'MD' 'KE' 'SI' 'CH' 'VN' 'AS' 'TR' 'CZ' 'DZ' 'EE' 'MY' 'AU' 'IE'] ********************************************************************************************* company_size: ['L' 'S' 'M'] *********************************************************************************************
print(f"work_year: {num_cols1['work_year'].unique()}")
print('\n*********************************************************************************************\n')
for c in range(cat_cols1.shape[1]):
print(f"{cat_cols1.columns[c]}: {cat_cols1.iloc[:, c].unique()}")
print('\n*********************************************************************************************\n')
work_year: [2020. 2021. 2022. nan] ********************************************************************************************* experience_level: ['Mid Level' 'Senior Level' 'Entry Level' 'Executive Level'] ********************************************************************************************* employment_type: ['Full Time' 'Contract' 'Part Time' 'Freelance'] ********************************************************************************************* job_title: ['Data Scientist' 'Machine Learning Scientist' 'Big Data Engineer' 'Product Data Analyst' 'Machine Learning Engineer' 'Data Analyst' 'Lead Data Scientist' 'Business Data Analyst' 'Lead Data Engineer' 'Lead Data Analyst' 'Data Engineer' 'Data Science Consultant' 'BI Data Analyst' 'Director of Data Science' 'Research Scientist' 'Machine Learning Manager' 'Data Engineering Manager' 'Machine Learning Infrastructure Engineer' 'ML Engineer' 'AI Scientist' 'Computer Vision Engineer' 'Principal Data Scientist' 'Data Science Manager' 'Head of Data' '3D Computer Vision Researcher' 'Data Analytics Engineer' 'Applied Data Scientist' 'Marketing Data Analyst' 'Cloud Data Engineer' 'Financial Data Analyst' 'Computer Vision Software Engineer' 'Director of Data Engineering' 'Data Science Engineer' 'Principal Data Engineer' 'Machine Learning Developer' 'Applied Machine Learning Scientist' 'Data Analytics Manager' 'Head of Data Science' 'Data Specialist' 'Data Architect' 'Finance Data Analyst' 'Principal Data Analyst' 'Big Data Architect' 'Staff Data Scientist' 'Analytics Engineer' 'ETL Developer' 'Head of Machine Learning' 'NLP Engineer' 'Lead Machine Learning Engineer' 'Data Analytics Lead' nan] ********************************************************************************************* employee_residence: ['DE' 'JP' 'GB' 'HN' 'US' 'HU' 'NZ' 'FR' 'IN' 'PK' 'PL' 'PT' 'CN' 'GR' 'AE' 'NL' 'MX' 'CA' 'AT' 'NG' 'PH' 'ES' 'DK' 'RU' 'IT' 'HR' 'BG' 'SG' 'BR' 'IQ' 'VN' 'BE' 'UA' 'MT' 'CL' 'RO' 'IR' 'CO' 'MD' 'KE' 'SI' 'HK' 'TR' 'RS' 'PR' 'LU' 'JE' 'CZ' 'AR' 'DZ' 'TN' 'MY' 'EE' 'AU' 'BO' 'IE' 'CH' nan] ********************************************************************************************* remote_ratio: ['No remote work' 'Partially remote' 'Fully remote'] ********************************************************************************************* company_location: ['DE' 'JP' 'GB' 'HN' 'US' 'HU' 'NZ' 'FR' 'IN' 'PK' 'CN' 'GR' 'AE' 'NL' 'MX' 'CA' 'AT' 'NG' 'ES' 'PT' 'DK' 'IT' 'HR' 'LU' 'PL' 'SG' 'RO' 'IQ' 'BR' 'BE' 'UA' 'IL' 'RU' 'MT' 'CL' 'IR' 'CO' 'MD' 'KE' 'SI' 'CH' 'VN' 'AS' 'TR' 'CZ' 'DZ' 'EE' 'MY' 'AU' 'IE' nan] ********************************************************************************************* company_size: ['L' 'S' 'M' nan] *********************************************************************************************
Observational Analysis
data.groupby("job_title").agg(minimum_salary_in_usd=("salary_in_usd","min"),maximum_salary_in_usd=("salary_in_usd","max"),average_salary_in_usd=("salary_in_usd","mean")).reset_index()
| job_title | minimum_salary_in_usd | maximum_salary_in_usd | average_salary_in_usd | |
|---|---|---|---|---|
| 0 | 3D Computer Vision Researcher | 5409 | 5409 | 5409.000000 |
| 1 | AI Scientist | 12000 | 200000 | 66135.571429 |
| 2 | Analytics Engineer | 135000 | 205300 | 175000.000000 |
| 3 | Applied Data Scientist | 54238 | 380000 | 175655.000000 |
| 4 | Applied Machine Learning Scientist | 31875 | 423000 | 142068.750000 |
| 5 | BI Data Analyst | 9272 | 150000 | 74755.166667 |
| 6 | Big Data Architect | 99703 | 99703 | 99703.000000 |
| 7 | Big Data Engineer | 5882 | 114047 | 51974.000000 |
| 8 | Business Data Analyst | 18442 | 135000 | 76691.200000 |
| 9 | Cloud Data Engineer | 89294 | 160000 | 124647.000000 |
| 10 | Computer Vision Engineer | 10000 | 125000 | 44419.333333 |
| 11 | Computer Vision Software Engineer | 70000 | 150000 | 105248.666667 |
| 12 | Data Analyst | 6072 | 200000 | 92893.061856 |
| 13 | Data Analytics Engineer | 20000 | 110000 | 64799.250000 |
| 14 | Data Analytics Lead | 405000 | 405000 | 405000.000000 |
| 15 | Data Analytics Manager | 105400 | 150260 | 127134.285714 |
| 16 | Data Architect | 90700 | 266400 | 177873.909091 |
| 17 | Data Engineer | 4000 | 324000 | 112725.000000 |
| 18 | Data Engineering Manager | 59303 | 174000 | 123227.200000 |
| 19 | Data Science Consultant | 5707 | 103000 | 69420.714286 |
| 20 | Data Science Engineer | 40189 | 127221 | 75803.333333 |
| 21 | Data Science Manager | 54094 | 241000 | 158328.500000 |
| 22 | Data Scientist | 2859 | 412000 | 108187.832168 |
| 23 | Data Specialist | 165000 | 165000 | 165000.000000 |
| 24 | Director of Data Engineering | 113476 | 200000 | 156738.000000 |
| 25 | Director of Data Science | 130026 | 325000 | 195074.000000 |
| 26 | ETL Developer | 54957 | 54957 | 54957.000000 |
| 27 | Finance Data Analyst | 61896 | 61896 | 61896.000000 |
| 28 | Financial Data Analyst | 100000 | 450000 | 275000.000000 |
| 29 | Head of Data | 32974 | 235000 | 160162.600000 |
| 30 | Head of Data Science | 85000 | 224000 | 146718.750000 |
| 31 | Head of Machine Learning | 79039 | 79039 | 79039.000000 |
| 32 | Lead Data Analyst | 19609 | 170000 | 92203.000000 |
| 33 | Lead Data Engineer | 56000 | 276000 | 139724.500000 |
| 34 | Lead Data Scientist | 40570 | 190000 | 115190.000000 |
| 35 | Lead Machine Learning Engineer | 87932 | 87932 | 87932.000000 |
| 36 | ML Engineer | 15966 | 270000 | 117504.000000 |
| 37 | Machine Learning Developer | 78791 | 100000 | 85860.666667 |
| 38 | Machine Learning Engineer | 20000 | 250000 | 104880.146341 |
| 39 | Machine Learning Infrastructure Engineer | 50180 | 195000 | 101145.000000 |
| 40 | Machine Learning Manager | 117104 | 117104 | 117104.000000 |
| 41 | Machine Learning Scientist | 12000 | 260000 | 158412.500000 |
| 42 | Marketing Data Analyst | 88654 | 88654 | 88654.000000 |
| 43 | NLP Engineer | 37236 | 37236 | 37236.000000 |
| 44 | Principal Data Analyst | 75000 | 170000 | 122500.000000 |
| 45 | Principal Data Engineer | 185000 | 600000 | 328333.333333 |
| 46 | Principal Data Scientist | 148261 | 416000 | 215242.428571 |
| 47 | Product Data Analyst | 6072 | 20000 | 13036.000000 |
| 48 | Research Scientist | 42000 | 450000 | 109019.500000 |
| 49 | Staff Data Scientist | 105000 | 105000 | 105000.000000 |
data1.groupby("job_title").agg(minimum_salary_in_usd=("salary_in_usd","min"),maximum_salary_in_usd=("salary_in_usd","max"),average_salary_in_usd=("salary_in_usd","mean")).reset_index()
| job_title | minimum_salary_in_usd | maximum_salary_in_usd | average_salary_in_usd | |
|---|---|---|---|---|
| 0 | 3D Computer Vision Researcher | 5409.0 | 5409.0 | 5409.000000 |
| 1 | AI Scientist | 12000.0 | 200000.0 | 59902.900000 |
| 2 | Analytics Engineer | 30240.0 | 205300.0 | 117640.000000 |
| 3 | Applied Data Scientist | 54238.0 | 380000.0 | 156459.166667 |
| 4 | Applied Machine Learning Scientist | 31875.0 | 423000.0 | 142068.750000 |
| 5 | BI Data Analyst | 9272.0 | 150000.0 | 60019.181818 |
| 6 | Big Data Architect | 40320.0 | 99703.0 | 61465.750000 |
| 7 | Big Data Engineer | 5882.0 | 114047.0 | 50399.111111 |
| 8 | Business Data Analyst | 18442.0 | 135000.0 | 61792.000000 |
| 9 | Cloud Data Engineer | 22680.0 | 160000.0 | 69010.800000 |
| 10 | Computer Vision Engineer | 10000.0 | 125000.0 | 44419.333333 |
| 11 | Computer Vision Software Engineer | 70000.0 | 150000.0 | 105248.666667 |
| 12 | Data Analyst | 6072.0 | 200000.0 | 91366.270000 |
| 13 | Data Analytics Engineer | 20000.0 | 110000.0 | 55709.625000 |
| 14 | Data Analytics Lead | 405000.0 | 405000.0 | 405000.000000 |
| 15 | Data Analytics Manager | 105400.0 | 150260.0 | 127134.285714 |
| 16 | Data Architect | 90700.0 | 266400.0 | 177873.909091 |
| 17 | Data Engineer | 4000.0 | 324000.0 | 109898.540146 |
| 18 | Data Engineering Manager | 59303.0 | 174000.0 | 123227.200000 |
| 19 | Data Science Consultant | 5707.0 | 103000.0 | 69420.714286 |
| 20 | Data Science Engineer | 40189.0 | 127221.0 | 75803.333333 |
| 21 | Data Science Manager | 32760.0 | 241000.0 | 136406.800000 |
| 22 | Data Scientist | 2859.0 | 412000.0 | 103577.402597 |
| 23 | Data Specialist | 35280.0 | 165000.0 | 82720.000000 |
| 24 | Director of Data Engineering | 113476.0 | 200000.0 | 156738.000000 |
| 25 | Director of Data Science | 130026.0 | 325000.0 | 195074.000000 |
| 26 | ETL Developer | 54957.0 | 54957.0 | 54957.000000 |
| 27 | Finance Data Analyst | 61896.0 | 61896.0 | 61896.000000 |
| 28 | Financial Data Analyst | 37800.0 | 450000.0 | 163330.000000 |
| 29 | Head of Data | 32974.0 | 235000.0 | 160162.600000 |
| 30 | Head of Data Science | 85000.0 | 224000.0 | 146718.750000 |
| 31 | Head of Machine Learning | 79039.0 | 79039.0 | 79039.000000 |
| 32 | Lead Data Analyst | 19609.0 | 170000.0 | 92203.000000 |
| 33 | Lead Data Engineer | 56000.0 | 276000.0 | 139724.500000 |
| 34 | Lead Data Scientist | 40570.0 | 190000.0 | 115190.000000 |
| 35 | Lead Machine Learning Engineer | 87932.0 | 87932.0 | 87932.000000 |
| 36 | ML Engineer | 15966.0 | 270000.0 | 89906.400000 |
| 37 | Machine Learning Developer | 32760.0 | 100000.0 | 66660.333333 |
| 38 | Machine Learning Engineer | 20000.0 | 250000.0 | 101451.954545 |
| 39 | Machine Learning Infrastructure Engineer | 50180.0 | 195000.0 | 101145.000000 |
| 40 | Machine Learning Manager | 117104.0 | 117104.0 | 117104.000000 |
| 41 | Machine Learning Scientist | 12000.0 | 260000.0 | 158412.500000 |
| 42 | Marketing Data Analyst | 35280.0 | 88654.0 | 55591.333333 |
| 43 | NLP Engineer | 37236.0 | 37236.0 | 37236.000000 |
| 44 | Principal Data Analyst | 75000.0 | 170000.0 | 122500.000000 |
| 45 | Principal Data Engineer | 185000.0 | 600000.0 | 328333.333333 |
| 46 | Principal Data Scientist | 148261.0 | 416000.0 | 215242.428571 |
| 47 | Product Data Analyst | 6072.0 | 20000.0 | 13036.000000 |
| 48 | Research Scientist | 25200.0 | 450000.0 | 96713.263158 |
| 49 | Staff Data Scientist | 105000.0 | 105000.0 | 105000.000000 |
data["company_location"].value_counts(sort=True)
US 355 GB 47 CA 30 DE 28 IN 24 FR 15 ES 14 GR 11 JP 6 NL 4 PL 4 AT 4 PT 4 LU 3 PK 3 TR 3 AE 3 BR 3 MX 3 DK 3 AU 3 CH 2 SI 2 NG 2 CN 2 BE 2 RU 2 CZ 2 IT 2 UA 1 DZ 1 VN 1 CO 1 RO 1 NZ 1 EE 1 AS 1 HU 1 KE 1 MY 1 HR 1 CL 1 SG 1 IL 1 HN 1 MD 1 IR 1 MT 1 IE 1 IQ 1 Name: company_location, dtype: int64
data1["company_location"].value_counts(sort=True)
US 355 PL 71 GB 47 CA 30 DE 28 IN 24 FR 15 ES 14 GR 11 JP 6 NL 4 AT 4 PT 4 BR 3 AE 3 LU 3 TR 3 MX 3 PK 3 DK 3 AU 3 CN 2 BE 2 RU 2 SI 2 CH 2 CZ 2 NG 2 IT 2 AS 1 DZ 1 UA 1 VN 1 RO 1 NZ 1 KE 1 SG 1 IQ 1 HU 1 MY 1 IL 1 MD 1 MT 1 IE 1 HN 1 IR 1 EE 1 CL 1 HR 1 CO 1 Name: company_location, dtype: int64
data["work_year"].value_counts()
2022 318 2021 217 2020 72 Name: work_year, dtype: int64
data1["work_year"].value_counts()
2022.0 344 2021.0 239 2020.0 91 Name: work_year, dtype: int64
data["experience_level"].value_counts()
Senior Level 280 Mid Level 213 Entry Level 88 Executive Level 26 Name: experience_level, dtype: int64
data1["experience_level"].value_counts()
Senior Level 280 Mid Level 245 Entry Level 88 Executive Level 69 Name: experience_level, dtype: int64
data["company_size"].value_counts()
M 326 L 198 S 83 Name: company_size, dtype: int64
data1["company_size"].value_counts()
M 340 L 225 S 109 Name: company_size, dtype: int64
Univarient Analysis
fig = make_subplots(
rows=5, cols=3,
subplot_titles=("Experience Level", "Job Titles", "Employment Type", "Salary in USD", "Company Size", "Employee Residence", "Remote Ratio", "Company Location"),
horizontal_spacing= 0.05,
vertical_spacing=0.05,
specs=[[{'type':'domain'}, None, {"rowspan": 2}],
[{}, None, None],
[{"colspan": 3}, None, None],
[{'type':'domain'}, {"colspan": 2}, None],
[{'type':'domain'}, {"colspan": 2}, None]],
)
fig.add_trace(go.Pie(labels=data['experience_level'].value_counts().index, values=data['experience_level'].value_counts(), hole=0.45),
row=1, col=1)
fig.add_trace(go.Bar(x=data['employment_type'].value_counts().sort_values(ascending=False).index, y=data['employment_type'].value_counts().sort_values(ascending=False)),
row=2, col=1)
fig.add_trace(go.Bar(x=data['job_title'].value_counts().sort_values(ascending=True), y=data['job_title'].value_counts().sort_values(ascending=True).index, orientation='h'),
row=1, col=3)
fig.add_trace(go.Histogram(x=data['salary_in_usd']),
row=3, col=1)
fig.add_trace(go.Bar(x=data['employee_residence'].value_counts().sort_values(ascending=False).index, y=data['employee_residence'].value_counts().sort_values(ascending=False)),
row=4, col=2)
fig.add_trace(go.Bar(x=data['company_location'].value_counts().sort_values(ascending=False).index, y=data['company_location'].value_counts().sort_values(ascending=False)),
row=5, col=2)
fig.add_trace(go.Pie(labels=data['company_size'].value_counts().index, values=data['company_size'].value_counts(), hole=0.65),
row=4, col=1)
fig.add_trace(go.Pie(labels=data['remote_ratio'].value_counts().index, values=data['remote_ratio'].value_counts(), hole=0.65),
row=5, col=1)
fig.update_xaxes(color='white')
fig.update_yaxes(color='white')
fig.update_layout(height=2000, width=1064, margin=dict(l=40, r=20, b=50, t=50), showlegend=False)
fig.layout.template='plotly_dark'
fig.show()
From the above plot we can learn that:
There are more people who are mid level(35%) and senior level(46%) working.
The top 5 job titles are Data Scientist, Data Engineer, Data Analyst, Machine Learning Engineer and Research Scientist.
Most of the employees have a full time job whereas freelancing is done the least.
The average salary lies between 40K to 160k.
Most of the the companies are mid sized and also the work type in these countris is mostly fully remote.
Most of the companies are located in USA and the employees also mostly reside in USA.
fig = make_subplots(
rows=5, cols=3,
subplot_titles=("Experience Level", "Job Titles", "Employment Type", "Salary in USD", "Company Size", "Employee Residence", "Remote Ratio", "Company Location"),
horizontal_spacing= 0.05,
vertical_spacing=0.05,
specs=[[{'type':'domain'}, None, {"rowspan": 2}],
[{}, None, None],
[{"colspan": 3}, None, None],
[{'type':'domain'}, {"colspan": 2}, None],
[{'type':'domain'}, {"colspan": 2}, None]],
)
fig.add_trace(go.Pie(labels=data1['experience_level'].value_counts().index, values=data1['experience_level'].value_counts(), hole=0.45),
row=1, col=1)
fig.add_trace(go.Bar(x=data1['employment_type'].value_counts().sort_values(ascending=False).index, y=data1['employment_type'].value_counts().sort_values(ascending=False)),
row=2, col=1)
fig.add_trace(go.Bar(x=data1['job_title'].value_counts().sort_values(ascending=True), y=data1['job_title'].value_counts().sort_values(ascending=True).index, orientation='h'),
row=1, col=3)
fig.add_trace(go.Histogram(x=data1['salary_in_usd']),
row=3, col=1)
fig.add_trace(go.Bar(x=data1['employee_residence'].value_counts().sort_values(ascending=False).index, y=data1['employee_residence'].value_counts().sort_values(ascending=False)),
row=4, col=2)
fig.add_trace(go.Bar(x=data1['company_location'].value_counts().sort_values(ascending=False).index, y=data1['company_location'].value_counts().sort_values(ascending=False)),
row=5, col=2)
fig.add_trace(go.Pie(labels=data1['company_size'].value_counts().index, values=data1['company_size'].value_counts(), hole=0.65),
row=4, col=1)
fig.add_trace(go.Pie(labels=data1['remote_ratio'].value_counts().index, values=data1['remote_ratio'].value_counts(), hole=0.65),
row=5, col=1)
fig.update_xaxes(color='white')
fig.update_yaxes(color='white')
fig.update_layout(height=2000, width=1064, margin=dict(l=40, r=20, b=50, t=50), showlegend=False)
fig.layout.template='plotly_dark'
fig.show()
From the above plot we can learn that:
--- There are more people who are mid level(35.9%) and senior level(41.1%) working. In Poland majority of these roles are yet considered new type of work and therefore job roles are mostly entry levels or junior roles and this causes in plot an increase in entry and mid level positions while decrease in senior level ones.
--- The top 5 job titles are Data Scientist, Data Engineer, Data Analyst, Machine Learning Engineer and Research Scientist.
--- Most of the employees have a full time job whereas contract is done the least. For Original data, freelance was least. This is because in Poland freelance is popular so this is why it has some value compare to part time and contract
--- The average salary lies between 40K to 140k.
--- Most of the the companies are mid sized and also the work type in these countris is mostly fully remote. However, while comparing it with original data we can say that in Poland small and large sized companies are more than in others this is why it has a slight increase in those.
--- Most of the companies are located in USA and the employees also mostly reside in USA. Poland placed in 2nd place for residence for employees with value of 71 while GB is 3rd place with value of 47
Violin Distribution Plots
Experience Level vs Salary in USD
fig = px.violin(data, x='salary_in_usd', y='experience_level', color='experience_level', height=500, width=1000)
fig.layout.template='plotly_dark'
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'})
fig = px.violin(data1, x='salary_in_usd', y='experience_level', color='experience_level', height=500, width=1000)
fig.layout.template='plotly_dark'
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'})
Job Titles in the years 2020, 2021 and 2022
fig = px.histogram(data, x='job_title', color='work_year', height=800, width=1000)
fig.layout.template='plotly_dark'
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})
fig = px.histogram(data1, x='job_title', color='work_year', height=800, width=1000)
fig.layout.template='plotly_dark'
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})
Number of Job Roles in a company with a certain Company Size
fig = px.histogram(data, x='job_title', color='company_size', height=800, width=1000)
fig.layout.template='plotly_dark'
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig = px.histogram(data1, x='job_title', color='company_size', height=800, width=1000)
fig.layout.template='plotly_dark'
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
Job Title by Salary with Experience Level
fig = px.scatter(data, x='salary_in_usd', y='job_title', color='experience_level', height=1000, width=1000)
fig.layout.template='plotly_dark'
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'})
fig = px.scatter(data1, x='salary_in_usd', y='job_title', color='experience_level', height=1000, width=1000)
fig.layout.template='plotly_dark'
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'})
Job Title by Salary with Company Size
fig = px.scatter(data, x='salary_in_usd', y='job_title', color='company_size', height=1000, width=1000)
fig.layout.template='plotly_dark'
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'})
fig = px.scatter(data1, x='salary_in_usd', y='job_title', color='company_size', height=1000, width=1000)
fig.layout.template='plotly_dark'
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'})
CONCLUSION
Thank You!!